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We consider the problem of locating a jump discontinuity (change- 
point) in a smooth parametric regression model with a bounded co- 
variate. It is assumed that one can sample the covariate at different 
values and measure the corresponding responses. Budget constraints 
dictate that a total of n such measurements can be obtained. A mul- 
tistage adaptive procedure is proposed, where at each stage an es- 
timate of the change point is obtained and new points are sampled 
from its appropriately chosen neighborhood. It is shown that such 
procedures accelerate the rate of convergence of the least squares es- 
timate of the change-point. Further, the asymptotic distribution of 
the estimate is derived using empirical processes techniques. The lat- 
ter result provides guidelines on how to choose the tuning parameters 
of the multistage procedure in practice. The improved efficiency of 
the procedure is demonstrated using real and synthetic data. This 
problem is primarily motivated by applications in engineering sys- 
tems. 

1. Introduction. The problem of estimating the location of a jump dis- 
continuity (change-point) in an otherwise smooth curve has been extensively 
studied in the nonparametric regression and survival analysis literature [see, 
e.g., Dempfle and Stute (2002), Gijbels, Hah and Kneip (1999), Hall and 
Molchanov (2003), Kosorok and Song (2007), Koul and Qian (2002), Loader 
(1996), Miiller (1992), Miiller and Song (1997), Pons (2003), Ritov (1990) 
and references therein]. In the classical setting, measurements on all n covariate- 
response pairs are available in advance, and the main issue is to estimate 
as accurately as possible the location of the change-point. However, there 
are applications where it is possible to sample the response at any covariate 
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value of the experimenter's choice. The only hard constraint is that the total 
budget of measurements to be obtained is fixed a priori. 

For example, consider the following example from system engineering. 
There is a stochastic flow of jobs/customers of various types arriving to the 
system with random service requests. Jobs waiting to be served are placed 
in queues of infinite capacity. The system's resources are allocated to the 
various job classes (queues) according to some service policy. This system 
serves as a canonical queueing model for many applications, including net- 
work switches, flexible manufacturing systems, wireless communications, etc. 
[Hung and Michailidis (2008)]. A quantity of great interest to the system's 
operator is the average delay of the customers, which is a key performance 
metric of the quality of service offered by the system. 

The average delay of the customers in a two-class system as a function 
of its loading, for a resource allocation policy introduced and discussed in 
Hung and Michailidis (2008), is shown in Figure 1. Specifically, the sys- 
tem was simulated under 134 loading settings and fed by input /service re- 
quest processes obtained from real network traces and the average delay of 
500,000 customers recorded. It can be seen that for loading around 0.8 there 
is a marked discontinuity in the response, which indicates that under the 
specified resource-allocation policy the service provided to the customers 
deteriorates. It is of interest to locate the "threshold" where such a change 
in the quality of service occurs. It should be pointed out that this threshold 
would occur at different system loadings for different allocation policies. 

A few comments on the setting implied by this example are in order. First, 
the experimenter can select covariate values (in this case, the system's load- 
ing) and subsequently obtain their corresponding sample responses. Second, 
the sampled responses are expensive to obtain; for example, the average de- 
lay is obtained by running a fairly large-scale discrete event simulation of the 
system under consideration, involving half a million customers. For systems, 
comprising of a large number of customer classes, more computationally in- 
tensive simulations that can last days must be undertaken. Third, in many 
situations there is an a priori fixed budget of resources; for this example it 
may correspond to CPU time, in other engineering applications to emulation 
time, while in other scientific contexts to real money. 

Given the potentially limited budget of points that can be sampled and 
lack of a priori knowledge about the location of the change-point, the fol- 
lowing strategy looks promising. A certain portion of the budget is used to 
obtain an initial estimate of the change-point based on a least squares crite- 
rion. Subsequently, a neighborhood around this initial estimate is specified 
and the remaining portion of the available points are sampled from it, to- 
gether with their responses, that yield a new estimate of the change-point. 
Intuition suggests that if the first-stage estimate is fairly accurate, the more 
intensive sampling in its neighborhood ought to produce a more accurate 
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estimate than the one that would have been obtained by laying out the en- 
tire budget of points in a uniform fashion. Obviously, the procedure with its 
"zoom-in" characteristics can be extended beyond two stages. 

The goal of this paper is to formally introduce such multistage adap- 
tive procedures for change-point estimation and examine their properties. 
In particular, the following important issues are studied and resolved: (i) the 
selection of the size of the neighborhoods, (ii) the rate of convergence of the 
multistage least squares estimate, together with its asymptotic distribution 
and (iii) allocation of the available budget at each stage. 

The proposed procedure should be contrasted with the well-studied se- 
quential techniques for change-point detection, since the underlying setting 
exhibits marked differences. In its simplest form, the sequential change-point 
detection problem can be formulated as follows: there is a process that gen- 
erates a sequence of independent observations Xi , X2 , . . . from some dis- 
tribution Fq. At some unknown point in time r, the distribution changes 
and hence observations Xt-,Xt-\-i, ■ ■ ■ are generated from Fi. The objective 
is to raise an alarm as soon as the data-generating mechanism switches to 
a new distribution. This problem originally arose in statistical quality con- 
trol and over the years has found important applications in other fields. 




Fig. 1. Average delay as a function of system loading for a two-class parallel processing 
system. 
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Being a canonical problem in sequential analysis, many detection proce- 
dures have been proposed in the literature over the years in discrete and 
continuous time, under various assumptions on the distribution of r and 
the data-generating mechanism. The literature on this subject is truly enor- 
mous; a comprehensive treatment of the problem can be found in the book 
by Basseville and Nikiforov (1993), while some recent developments and new 
challenges are discussed in the review paper by Lai (2001). An important 
difference in our setting is the control that the experimenter exercises over 
the data generation process and also the absence of physical time, a crucial 
element in the sequential change-point problem. 

The remainder of the paper is organized as follows. In Section 2, the one- 
stage procedure is briefly reviewed. In Section 3, the proposed procedure 
based on adaptive sampling is introduced and the main results regarding 
the rate of convergence of the new estimator of the change point and its 
asymptotic distribution are established; in addition, various generalizations 
are discussed. In Section 4, the performance of the proposed estimator in 
finite samples is studied through an extensive simulation and practical guide- 
lines are discussed for its various tuning parameters. In Section 5, various 
techniques for constructing confidence intervals for the change-point, based 
on the two-stage procedure, are discussed. Finally, the majority of the tech- 
nical details are provided in the Appendix. 

2. The classical problem. In this study, we focus on parametric models 
for the regression function of the type 

Yi = fi{Xi) + €i, i = l,2,...,n, 

where 

(1) /i(x) = i^iiPi,x)l{x < d'^) + M(3u,x)lix > dO) 

with tpi((3i,x) and tpuiPu,^) are both (at least) twice continuously differen- 
tiable in /? and infinitely differentiable in x and i/ji{Pi,(f) / t(^u{Pu,dP), so 
that dP is the unique point of discontinuity — a change point — of the regres- 
sion function. 

The ej's are assumed to be i.i.d. symmetric mean errors with common 
(unknown) error variance cr^ and are independent of the Xj's, which are 
i.i.d. and are distributed on [0, 1] according to some common density fx{-)- 
The simplest possible parametric candidate for ij,{x), which we will focus on 
largely to illustrate the key ideas in the paper, is the simple step function 
li{x) = aol{x < + (3ol{x > d°). 

Estimating based on the above data is coined as the "classical prob- 
lem." A standard way to estimate the parameters {f3i,f3u,d^) is to solve a 
least-squares problem. We start by introducing some necessary notation. 
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Let ¥n denote the empirical distribution of the data vector {Xi,Yi}f^i and 
P the true distribution of {Xi,Yi). For a function / defined on the space 
[0,1] X M (in which the vector {Xi,Yi) assumes values) and a measure Q 
defined on the Borel cr-field on [0, 1] x M, we denote / f dQ as Qf. We now 
turn our attention to the least-squares problem. 

The objective is to minimize ¥n[{y — ipi{a, x))'^l{x < d) + {y — ipu{f^u,x))'^l x 
{x > d)] over all {a,(3,d), with < d < 1. Let {(3i^n, Pu,n,dn) denote a vector 
of minimizers. Note that we refer to "a vector" of minimizers, since there 
will be, in general, multiple tri-vectors that minimize the criterion function. 
The asymptotic properties of such a vector can be studied by invoking either 
the methods of Pons (2003) or those (in Chapter 14) of Kosorok (2008) and 
Kosorok and Song (2007). We do not provide the details, but state the results 
that are essential to the multistage learning procedures that we formulate 
in the next section. We clarify next the meaning of a minimizer of a right- 
continuous, real-valued function with left limits (say /) defined on an interval 
/. Specifically, any point z G I that satisfies f{z) A f{z—) = min^g//(ii)) is 
defined to be a minimizer of /. Also, in order to discuss the results for the 
classical procedure and those for the proposed multistage procedures, we 
need to define a family of compound Poisson processes that arise in the de- 
scription of the asymptotic properties of the estimators of the change point. 

A family of compound Poisson processes. For a positive constant A, let 
"*"(•) be a Poisson process on [0,oo) with right-continuous sample paths, 
with v^is) ~ Poi(As) for s > 0. Let i'^(-) be another independent Poisson 
process on [0,oo) with left-continuous sample paths, with t'^(s) ~ Poi(As) 
and define a (right-continuous) Poisson process on (— oo,0] by {i'~{s) = 
z>~'~(— s):sG (—00,0]}. Let and {r]-i}^i be two independent se- 

quences of i.i.d. random variables where each r]j (j assumes both positive 
and negative values) is distributed like 77, i] being a mean random variable 
with finite variance p^. Given a positive constant A, define families of random 
variables {V^}"^-^ and {V^^j'^i where, for each i>l, = Ajl + r]i and 
= —AI2 + 7/_j. Set Vq = Vq = 0. Next, define compound Poisson pro- 
cesses Ml and M2 on (-00,00) as follows: Mi(s) = (Eo<i<i/+(s) ^t)^i^ ^ 0) 
and M2(s) = (I]o<i<!/-(s) — 0)- Finally, define the two-sided com- 

pound Poisson process M^^^^a('5) = Mi(s) — M2(s). It is not difficult to see 
that M, almost surely, has a minimizer (in which case it has multiple mini- 
mizers, since the sample paths are piecewise constant). Let di{A, t], A) denote 
the smallest minimizer of M.j^^jj^\ and du{A,r],A) its largest one, which are 
almost surely well defined. Then, the following relation holds: 

(2) {di{A,7^,A),du{A,7^,A)) = (^di(^^,'l,Aj,du(^^,^,A^^ 



1 / /An \ . ( A 
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For the "classical problem," the following proposition holds. 

Proposition 1. Consider the model described at the beginning of Sec- 
tion 2. Suppose that X has a positive bounded density on [0, 1] and that 



is Op(l). Furthermore, the first two components of this vector are asymptot- 
ically independent of the third and 



Heteroscedastic errors. The proposition can be generalized readily to 
cover the case of heteroscedastic errors. A generalization of the classical 
model to the heteroscedastic case is as follows: we observe n i.i.d. obser- 
vations from the model Y = IJ,{X) + cr{X)e where n{x) is as defined in 
(1), e and X are independent, e is a mean error with unit variance and 
cr'^{x) is a variance function (assumed continuous). As in the homoscedastic 
case, an unweighted least-squares procedure is used to estimate the param- 



3. The two-stage procedure. We first describe a two-stage procedure for 
estimating the (unique) change-point. In what follows, we consider a re- 
gression scenario where the response Yx generated at covariate level x can 
be written as 1^ = /i(x) + e, where e is a error variable with finite vari- 
ance and n is the regression function. The errors corresponding to different 
covariate levels are i.i.d. We first focus on the simple regression function 
IJ,{x) = aol{x < d^) + l3ol{x > d^) and discuss generalizations to more com- 
plex parametric models later on. We are allowed to sample n covariate- 
response pairs at most and are free to sample a response from any covariate 
level that we like. 

• Step 1. At stage one. An covariate values are sampled uniformly from 
[0, 1] and responses are obtained. Denote the observed data by {Xi, Yi}^J_^, 
ni = An and the corresponding estimated location of the change-point by 




n(4 - d^) ^ diMd^+) - fi{d^)\,ei,fxid^)) 





d. 
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• Step 2. Sample the remaining n2 = (1 — X)n covariate-response pairs {Ui, 
Wi}Zi, where: 

Wi = fi{Ui) + €i, Ui ~ Unif [a.„i , 6„J 

and [a.„j , 6„ J = — Kn^ , dn^ + A'ji^^ '''] , < 7 < 1 and K is some con- 
stant. 

Obtain an updated estimate of the change-point based on the n2 covariate- 
response pairs from stage two, which is denoted by (i^j • 

We discuss the basic procedure in some more detail. Let (a„^ , (3ni , ) de- 
note the parameter estimates obtained from stage one. Let P„2 denote the 
empirical measure of the data points {Ui,Wi}^^^. The updated estimates 
are computed by minimizing 

FnAKw - an,?I{u <d) + {w- f^mfl{u > d)], 

which, as is readily seen, is equivalent to minimizing the process 

MnM) = ^nAiiw - an,? - {w - $n,?}{I{u < d) - I{u < 

The process M.„2 is a piecewise-constant, right-continuous function with left 
limits. We let d„2,« ^-iid dn2,u denote its minimal and maximal minimizers, re- 
spectively. Our goal is to determine the joint limit distribution of normalized 
versions of {dn2,i,dn2,u)- This is described in the theorems that follow. 

Theorem 1. Assume that the error variable e in the regression model 
has a finite moment generating function in a neighborhood of 0. Then, the 
random vector n^~^'^{dn2,i — dP,dn2,u — dP) is Op{l). 

Remark. The proof of this theorem is fairly technical and particularly 
long and thus deferred to the Appendix. However, a few words regarding 
the intuition behind the accelerated rate of convergence are in order. For 
simplicity, consider sampling procedures where instead of sampling from a 
uniform distribution on the interval of interest, sampling takes place on a 
uniform grid on the interval. The interval from which sampling takes place 
at the second stage has length 2Kn^"' . Since the n2 covariate values are 
equispaced over this interval, the resolution of the resulting grid at which 
responses are measured is 0{ni"' /n2) = 0{n~^^'^'^^) and this determines the 
rate of convergence of the two-stage estimator (just as the rate of convergence 
in the classical procedure where n covariates are equispaced over [0, 1] is 
given by the resolution of the resulting grid in that situation, which is simply 
(n-i)). 

We next describe the limit distributions of the normalized estimates con- 
sidered in Theorem 1. 
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Theorem 2. Set C{K,X,'y) = (2J^)"i(A/(l - X)y . The random vector 
n2'^^{dn2,l ~ dPjdn2,u — d^) converges in distribution to 

{di{\ao- Po\,e,C{K,X,j)),du(,\ao- f3o\,e,C{K,X,j))). 

Remark. The asymptotic distributions of the "zoom-in" estimators are 
given by the minimizers of a compound Poisson process. The underlying 
Poisson process is basically the limiting version of the count process {Vnis) '■ s S 
M}, where Vn{s) counts the number of C/j's in the interval {d^ , d^ + 5/712"'''^] U 
can be readily checked that marginally, Vn{s), converges 
in distribution to a Poisson random variable with mean C{K, X,j)s, using 
the Poisson approximation to the Binomial distribution. On the other hand, 
the size of the jumps of the compound Poisson process is basically deter- 
mined by |qo — /3o|/c, the signal-to-noise ratio in the model. 

General parametric models. These results admit ready extensions to the 
case where the function fi{x) is as defined in (1). As in the case of a piece- 
wise constant /i, ni = An points are initially used to obtain least-squares 
estimates of {I3i, l3u,dP), which we denote by {Pi,ni, f^u,nndni)- Step 2 of the 
two-stage procedure is identical and the updated estimate dn^ is computed 
by minimizing the criterion function 

P„2[('u; - VKA,ni,ii)) I{u<d) + {uj-'ipu{l3u,ni,u)) I{u>d)], 
which is equivalent to minimizing 

M„, (d) = P„, [{{w - iJi ,u)f-{vu- MPu,m ,u))^} 

X {I{u<d) - I{u<d°))]. 

Letting dn^^i and dn2,u denote the smallest and largest argmins of M„,2, re- 
spectively (as in the piecewise-constant function case), we have the following 
proposition. 

Proposition 2. The random vector nl'^^{dn2,i - d^,dn2,u - d^) con- 
verges in distribution to 

{di{\UPud^)-MPu.d^)le,C{K,X,-i)), 

du{mPi,d')-MPu,d%e,CiK,X,^))). 

The heteroscedastic case. Similar results continue to hold for a het- 
eroscedastic regression model. We formulate the heteroscedastic setting as 
follows. At any given covariate level x, the observed response Yx = n{x) -\- 
a{x)e with ^{x) as defined in (1), a'^{x) is a (continuous) variance function 
and e is a symmetric error variable with unit variance. The errors correspond- 
ing to different covariate values are independent. Using the same two-stage 
procedure as described above, the following proposition obtains. 
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Proposition 3. We have 

^ {dii\MPi,d^) -MPu,d^)l'jid')e,C{K,X,^)), 
dui\MPi,d^)-MPu,d!')\,aid')e,C{K,X,^))). 

Remark. With choice of a constant variance function, (T^(x) = cr^, the 
heteroscedastic model reduces to the homoscedastic one. We nevertheless 
present results for these two situations separately. We also subsequently de- 
rive our results for the homoscedastic case, the derivations extending almost 
trivially to the heteroscedastic case. 

3.1. Proof of Theorem 2. For the proof of this theorem (and the proof of 
Lemma 3.2 in the Appendix) we denote the process M.^ao-i3o\,e,C{K,x,^) simply 
by M and its smallest and largest minimizers simply by (di,du). Our proof of 
this theorem will rely on continuous mapping for the arg min functional. For 
the sake of concreteness, in what follows, we assume that ao < Pq. Under this 
assumption, with probability increasing to 1 as n (and consequently ni) goes 
to infinity, < Pm and dP belongs to the set [dn^ — Kn^'^,dn^ + Kn^'^]. 
On this set {dn2,i,dn2,u) can be obtained by minimizing (the equivalent) 
criterion function 



■ 112 



w ■ 



{I{u <d)- I{u < d^)) 



and is characterized as: 



d = arg min P 



w 



[I{u <d)- I{u < d^)) 



where P is the distribution of {W,U). Therefore, in what follows, we take: 



M„,(d)=P, 



n2 



W 



{Iiu<d)- I{u<d^)) 



and dn2,i and dn2,u to be the smallest and largest argmins of this stochastic 



process. Set {Cn,i,Cn,u) 



■ n 



1+7 



{d, 



d^,dn2,u - c?°)- Then {S.n,i,Cn,u) is the 



vector of smallest and largest argmins of the stochastic process: 



712 



1=1 



aril + A 



■ni 



/( Ui<d^ + 



1+7 



n 



liUi < d') 



n2 



(s) - m; 



712 



where (s) 



I[n2('5)l(s> 0) and 



"^712^ 



is)lis<0). 
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We now introduce some notation that is crucial to the subsequent de- 
velopment. Let S denote the class of piecewise-constant, right-continuous 
functions with left limits (from M to M) that are continuous at every integer 
point, assume the value at and possess finitely many jumps in every 
compact interval [— C, C] where C > is an integer. Let / denote the pure 
jump process (of jump size 1) corresponding to the function /; that is, / is 
the piecewise-constant, right-continuous function with left limits, such that 
for any s > 0, /(s) counts the number of jumps of the function / in the 
interval [0, s], while for s < 0, /(s) counts the number of jumps in the set 
(s,0). 

For any positive integer C > 0, let P[— C, C] denote the class of all right- 
continuous functions with left limits with domain [— C, C] equipped with the 
Skorokhod topology and let 'D[—C,C] x [—C,C] denote the corresponding 
product space. Finally, let V^, denote the (metric) subspace of ([— C, C]) x 
([— C,C]) that comprises all function pairs of the form {f\[-c,c]^ f\[-C,C]) foi' 
f £ S. We have the following lemma that is proved in the Appendix section 
of Lan, Banerjee and Michailidis (2007). 



Lemma 3.1. Let {fn} o,nd /o he functions in S, such that for every 
positive integer C, {fn\[-c,C]Jn\[-c,c\) converges to (/o|[-c,q> /ol[-c,C]) 
Vq where /o satisfies the property that no two fiat stretches of /o have the 
same height. Let ln,c and Un,c denote the smallest and the largest minimizers 
of fn on [—C,C], and Iq^c and uq^c denote the corresponding functional for 
fo- Then {ln,c,Un,c) ^ {lo,c,uo,c)- 

Consider the sequence of stochastic processes (s) and let J„2 (s) denote 
the corresponding jump processes. We have: 

Sn, is) = sign(s) 1^ (^/ (u^ <cP + - I{Ui < (i°)) 

= JI+ (s) +JI~ (s), 

where JJ^2(s) = JIn2('5)l(s > 0) and 1^2(5) = JIn2(s)l(s < 0). The jump pro- 
cess corresponding to M(s) is denoted by JI(s) and is given by u~^{s)l{h > 
0) + i^~{s)l{h < 0). For each n, {M„2(s):s G M} lives in S with proba- 
bility one. Also, with probability one, {M(s):s G M} lives in S. Also, on 
a set of probability one (which does not depend on C), for every posi- 
tive integer C, {{M.n2{s),Sn2{s)) '■ s S [— C,C]) belongs to Vq and so does 
((M(s), J(s)) : s £ [— C, C]). Let {(,n,c,hCn,c,u) denote the smallest and largest 
argmin of M„2 restricted to [— C, C] and let {dc,i,dc,u) denote the corre- 
sponding functionals for M restricted to [— C, C] . We prove in the Appendix: 
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Lemma 3.2. For every C > 0, {{Mn2{s),Sn2{s)) ■ s € [-C,C]) converges 
in distribution to ((M(s), J(s)) :s G [—C,C]) in the space V^. 

Consider the function h that maps an element (a pair of functions) of Vq 
to the two-dimensional vector given by the smallest argmin and the largest 
argmin of the first component of the element. Using the fact that almost 
surely no two flat stretches of M have the same height, it follows by Lemma 
3.1 that the process ((M(s), JI(s)) : s G [— C, C]) belongs, almost surely, to 
the continuity set of the function h. This, coupled with the distributional 
convergence established in Lemma 3.2 leads to the conclusion that 

(4) iCn,c,hS,n,c,u) idc,l,dc,u)■ 

We will show that (^n.,i, Cn.u) {di,du). To this end, we use the following 
lemma. 



Lemma 3.3. Suppose that {Wne},{Wn\ o,nd {We} are three sets of ran- 
dom vectors such that: 

(i) lim^^o limsup„^oo P[Wne ^ Wn] = 0, 

(ii) linie^o P[We y^W] = and 

(iii) for every e > , Wne as oo. 
Then, Wn W, as n — > oo. 

Before applying the lemma, we first note the following facts: (a) The se- 
quence of (smallest and largest minimizers) {^n,hS,n,u) is Op{l), and (b) The 
minimizers {di,du) are Op(l). Now, in the above lemma, set e = 1/C, Wne = 
{in,c,hin,c,u)-, We = {dc,i,dc,u), Wn = {in,uin,u) and W = {dudu). Condi- 
tion (iii) is established in (4). Prom (a) and (b) it follows that conditions (i) 

and (ii) of the lemma are satisfied. We conclude that {S,n,i,(,n,u) —* {dudu)- 
□ 



Remark. It is instructive to compare the obtained result on the con- 
vergence of the (nonunique) argmin functional to that considered in Ferger 
(2004). Ferger deals with the convergence of the argmax functional under 
the Skorokhod topology in Theorems 2 and 3 of his paper. Since the argmax 
functional is not continuous under the Skorokhod topology, an exact result 
on distributional convergence cannot be achieved. Instead, asymptotic upper 
and lower bounds are obtained on the distribution function of the argmax 
in terms of the smallest maximizer and the largest maximizer of the limit 
process [page 88 of Ferger (2004)]. The result we obtain here is, admittedly, 
in a more specialized set-up than the one considered in his paper, but it 
is stronger since we are able to show exact distributional convergence of 
argmins. This is achieved at the cost of some extra effort: establishing the 
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joint convergence of the original processes, whose argmins are of interest, and 
their jump processes, and subsequently invoking continuous mapping. Under 
this stronger mode of convergence, the argmin functional indeed turns out 
to be continuous, as Lemma 3.1 shows [the arguments employed are similar 
in spirit to those in Section 14.5.1 of Kosorok (2008)]. This result allows 
us to construct asymptotic confidence intervals that have exact coverage at 
any given level, as opposed to the conservative intervals proposed in Fer- 
ger (2004). That the exact confidence intervals buy us significant precision 
over the conservative ones is evident from the reported simulation results 
discussed in Section 5. 



4. Multistage procedures and strategies for parameter allocation. 



4.1. Multistage procedures. We consider a generalization of the two-stage 
procedure to P stages in the setting of the heteroscedastic model with a 
general parametric regression function fi. Let Ai, A2,...,Ap be the pro- 
portions of points used at each stage (where Ai -|- A2 -|- • • • -|- Ap = 1) and 
let rij = Ajn. Also, fix sequences of numbers < 7p-i < • • • < 71 < 1 and 
Ki,K2, . . . , Kp-i (with Ki>0). Having used ni points to construct the ini- 
tial estimate d^ , in the qth {2 < q < P) stage, define the sampling neighbor- 
hood as [drig^i - Kq_iri~^^^~'^^~^'^'''^\dng_-^ + Kq^i7i~^'f~'^^^"'''~'^^], sample Uq 
covariate-response pairs {Wi, Ui}^^^ from this neighborhood: Wi = iJ.{Ui)+ei 
and update the estimate of the change-point to dn^- Let {dnp,i,dnp,u) de- 
note the smallest and largest estimates at stage P. It can be shown using 
analogous arguments as those in Theorems 3.1 and 3.2 that 



(P-l)+7P_l 



{idnp,l-d''),{dnp,u-d^)) 



is Op(l) and converges in distribution to {di,du), where {di,du) is the vector 
of the smallest and the largest argmins of the process 

m{\^l;i{(3i,d'>) - MPu,d'>)\,a{d'>)e,Cp) 
with Cp = (l/2Kp_i)(Ap_i/Ap)«^-2)+7P-i). 

4.2. Strategies for parameter allocation. In this section, we describe strate- 
gies for selecting the tuning parameters K,j and A used in the procedure. 
We do this in the setting of the simple regression model n{x) = aol{x < 
d^) + Po^{x > d^) and homoscedastic normal errors, obvious analogues hold- 
ing in more general settings. 

Recall that {dnq,udn^,u) are the minimal and maximal minimizers at step 
Q,2 < q < P. Set: dg^av = {dnq,i + dnq,u) /2. In what follows we use this as our 
qih stage estimate of the change-point. 
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We start with the case of P = 2 to fix ideas and motivate the general 
result. Using notation from Theorem 2 of this paper, we have: 

?^2 {d2,av-d}')^ — . 

It is also not difficult to see that this limit distribution is symmetric about 
0. 

Henceforth, the notation Argmin will denote the simple average of the 
minimal and maximal minimizers of a compound Poisson process. The quan- 
tity I ao — Po\/(T will be denoted as SNR (signal-to- noise ratio) . The higher 
the SNR, the more advantageous the estimation of the change-point at any 
given sample size would be. By (3), we have 

M = M|^ I C'(i^i,Ai,7i) ^ 77777-T r Argmin MsNR,z,i> 

I Mui, V (^(Ai, Ai,7i) 

where Z is the standard normal random variable. This is a consequence of 
the fact that e/o" ~ A^(0, 1). From Theorem 2 and the above display, we have 

.1+71 d 1 /AsA'^i 



n'+^'{d2,av - d^) ^ ^^2Ki ( ^ ) Argmin MsNR,z,i 



^2 
2Ki 



Argmin MsNR.Z,: 



For a given sample size n, the objective is to choose the parameters Ki and 
7i , so that the change-point d^ would be contained in the prescribed interval 
[di,av — Kin^'^^ , di^av + Kirii'^'^] with very high probability. In practice, this 
translates to choosing Ki in such a way that Kin^"'^ ~ C(^^/ni, where C(^^ 
is the upper Cith quantile of the distribution of Argmin MgNR,z,ii which is 
symmetric about 0. The value of Ci should be small, say 0.0005. In other 
words, our goal is to "zoom in," but not excessively so that we systematically 
start missing d^ in the prescribed sampling interval. With this choice for Ki 
we then have 



(5) 



2C 

ni ^'A2A7'n-(i+Ti) 
2C 

= o/' Argmin MsNR,z,i. 
n^A2Ai 



\2Al 

It can be seen that the right-hand side is minimized when setting Ai = A2 = 
1/2 (equal allocation of samples in the first and second stages). 

Consider now a one-stage procedure with the covariates sampled from a 
density fx, with the estimate of the change-point once again chosen to be 
the simple average of the minimal and maximal minimizers; call this dav 



14 



Y. LAN, M. BANERJEE AND G. MICHAILIDIS 



In this case, the standard change-point asymptotics in conjunction with (2) 
and (3) give 

n{dav - (f) J, ^,n\ ArgminMsNR,z,i- 

This immediately provides an expression of the asymptotic efficiency of the 
two-stage procedure with respect to the one-stage (in terms of ratios of 
approximate standard deviations) given by 

n 



(6) ARE2,i(n), 

It is not difficult to see that the same approximate formula for the ARE 
holds for some other measures of dispersion, besides the standard deviation. 
Let 

ARE2,i,mad(?^) = — 9 and ARE2,i,iQR(n) 



where both first and second-stage estimates are based on samples of size n, 
and IQR(X) denotes the interquartile range of the distribution of a random 
variable X . Then, following similar steps to those involved in calculating the 
ARE based on standard deviations, we conclude that 

ARE2,i,MAD(n) ^ ARE2,i,iQR(n) « 

oL-Ci Jxid") 

The accuracy of the above approximation is confirmed empirically through 
a simulation study. The setting involves a change-point model given by 

(7) = 0.51{xi < 0.5) + l.bl{xi > 0.5) + e^,x,e (0, 1). 

The variance cr^ was chosen so that the SNR defined as (/3o — O!o)/a = 1,2,5 
and 8 and the sample size varies in increments of 50 from 50 to 1500. The 
results based on an interval corresponding to Ci = 0.0025 and 5000 replica- 
tions are shown in Figure 2. Further, for the two-stage procedure the smallest 
and largest 5 "outliers" were dropped, since they introduced too much vari- 
ability, especially for the standard deviation based ARE. These "outliers" 
correspond to the cases where the true parameters was not contained in 
the zoom-in sampling interval. It can be seen that there is great agreement 
between the theoretical formula for the ARE and the empirical ARE, for all 
three performance measures employed. 

Remark. The formula for the ARE in (6) says that the "agnostic" two- 
stage procedure ("agnostic" since the covariates are sampled uniformly at 
each stage) will eventually, that is, with increasing n, surpass any one stage 
procedure, no matter the amount of background information incorporated 
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Fig. 2. Top panels: ARE for standard deviation, IQR and MAD measures for SNR= 1 
(left) and 2 (right). Bottom panels: corresponding AREforSNK — b (left) and 8 (right). 

about the location of the change-point in the one-stage process. One can 
think of an "oracle-type" one-stage procedure where the experimenter sam- 
ples the covariates from a density that peaks in a neighborhood of dP relative 
to the uniform density [corresponding to high values of fx{dP)]- The faster 
convergence rate of the two-stage procedure relative to this one-stage proce- 
dure guarantees that with increasing n, the ARE will always go to infinity. 
Further, expression (6) provides an approximation to the minimal sample 
size required for the two-stage procedure to outperform the "classical" one, 
a result verified through simulations (not shown here). 

Remark. A uniform density has been considered up to this point for 
sampling second-stage, covariate-response pairs. We examine next the case 
of using an arbitrary sampling density gu{-) supported on the interval [dnj — 
Kn^ , dm + Kn^ and symmetric around dn-i ■ A natural choice for such a 
density is gu{d?) = h{{dP — dm) / K){nl / K) , for a density h{-) supported 
on [—1, 1] and symmetric about 0. Analogous arguments to those used in the 
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proof of Theorem 2 establish that the random vector 712'^'^ {dn2 1 — )dn2 u — 
converges in distribution to 

{di{\ao- Po\,e,C{K,X,-/,h)),du{\ao - Po\,e,C{K, X,j,h))), 

where C{K,X,j,h) = (A/(l - X)y{h{0)/K). With the error term normahy 
distributed, as assumed in this section, we get 

d K 

n^^'^{d2,av - d^) -^-^^^y^^— ^^y^ ArgminMsNR,z,i 

and it can be readily checked that the approximate ARE formula reduces to 
ARE2,i(?i) ~ nh{0) / {ACc^ fx{d^)) ■ It can be seen that the more "peaked" the 
sampling density gjj (equivalently h) the greater the efficiency gains. How- 
ever, one needs to be careful, since the above formula is obtained through 
asymptotic considerations. In finite samples, a very peaked density around 
dim may not perform well, since bias issues [involving {dP — dm)] must also 
be taken into account. 



Remark. Simulation results indicate that in the presence of a small 
budget of available points (n = 20 or 50), the efficiency of the two-stage es- 
timator can be improved by employing a uniform (equispaced) design in the 
first-stage. The reason is that such a design reduces the sampling variability 
of the covariate x, which leads to improved localization of the change-point. 
However, the approximate formula for the ARE discussed above is no longer 
valid when we use a uniform design at the first-stage, since the asymptotics 
of the one-stage estimator are then no longer described by the minimizer 
of a compound Poisson process. For further discussion on uniform sampling 
designs, see Section 4.1 of Lan, Banerjee and Michailidis (2007). 

We turn our attention to the allocation of parameters Ki,Xi and 7.4, 1 < 
i < P — 1 for the P stage procedure. We start with a three-stage procedure 
and generalize afterward. From the theoretical result of Section 4.1 we have 

,,2+72 u _ A o T^. ( ^^^^ 

(8) 

(9) , 1+^, ArgminMsNR,z,i 



nt^^ idn,,av - d") A 2K2 (^^ ) ArgminMsNR,z,i 



A.sA 



or 



3^2 



d 2K2 



dn3,av - d - n, 1+^2 ArgminMsNR,z,i- 
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Once again, our objective is to choose K2 and 72 so that for any fixed sample 
size n, with very high probabihty would be contained in the sampling 
interval [dn2,ai; — -K'2't-2 ''^''''^^'*>'^n2,at' + -^2f^2 ^^'^'^^'*]- Thus, using (5) we require 

(1+72) _ (r,^^ M(n'^\^\A nr = OH. \ I ( \. \r'^2^1-72^ 



that K2n-^'^^^^ ^ (2CciCcJ/(n2A2Ai), or = (2Cc,C^J/(AiA, 
with both C,i^C2 being very small. With this choice for K2 we obtain 

dn;,av - d - \ \ ArgmmMsNR.z,i. 
n'^AiA2A3 

It is then easy to see that the right-hand size is minimized when Ai = A2 = 
A3 = 1/3. The ARE formula for the three-stage procedure under the pro- 
posed (equal) allocation of samples was validated through a simulation based 
on 5000 replications. The same stump model was employed and the values 
for SNR were set equal to 5 and 8, while those for Ci = C2 = 0.0025. The 
results using a similar trimming of "outliers" for the estimates of the three- 
stage procedure (3 smallest and 3 largest in this case) are shown in Figure 
3 and exhibit great agreement with the theoretical formula. 

Proceeding analogously to the three-stage procedure, we obtain that at 
stage Q — 1 , 



2^-^Cci---gc.-2 
n9-iAi---Ag„i 



dn,^^,av - « — TZTT^ ArgminMsNR,z,l• 



A level 1 — 2^g_i C.I. for (for a small C9-1) is then given by 



(10) 



_ 2" ^Cgi •••Cc,-2CCg-i ? , 2^ ^Cci •••Qg-2Cc,-i 

ni Ui---Ag_i ni Ui---Ag_i 




Fig. 3. ARE for standard deviation, IQR and MAD measures for SNR= 5 (left) and 8 
(right). 
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The sampling neighborhood of dnq_-^,av from which we sample at stage q is 
formally given by 



(11) 

and yields that 



{dnq,av - d°) . „^ ^"^ ArgminMsNR.z.i- 

Equating the neighborhoods (10) and (11) gives 



K, 



21 •••Cf^.aCf^.i 



Plugging this choice of Kq^i into the expression for the approximate distri- 
bution of dn„.rm — d^ gives 



dn„av - « ArgminMsNR.z,: 



Therefore, the P stage estimate (d^^ — d^) is approximately distributed as 



Ai • • • Xpn^ 



■ ArgminMsNR,z,i, 



which shows that an equal allocation of samples is warranted (i.e., Ai 
1/P,1 < i < P). Some straightforward algebra establishes that 

ARF ( ^- nP-'X^X2---Xp nP-^ 



• • • Cc,_ Jx(dO) 2P~^pPfx{d^)C^, ■ ■ ■ C^,_, ' 
where fx{d^) has the same connotation as in (6). 



Remark. An interesting question that arises in practice is that of how 
many stages to use given a fixed budget of n points. Obviously the answer 
depends on the underlying SNR, but n/P should be fairly large. Extensive 
numerical work with two, three and four-stage procedures indicates that n/P 
in the range of 35-50 points performs well. A related issue is how to choose 
the values of C,i that determine the coverage of the successive neighborhoods. 
Notice that with probability 1 — (1 — 2C,i), d^ is not contained in the sampling 
interval neighborhood around dm- Proceeding analogously and due to the 
independence of sampling at subsequent stages, it can be seen that with 
probability 1 — ni=i^(l ~ 2^j) the shrinking sampling intervals are going to 
miss d^ at some stage. Taking all the 2C,iS to be equal to ■0, we get that this 
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probability is given by 1 — (1 — V')^"^- Setting the probability of trapping 
dP after P stages equal to (1 — 6) {6 very small) we get that tp = 1 — {1 — 
^^1/(^-1)^ which provides a good guideline for determining the size of the 
sampling neighborhoods. There remains the issue of the sampling interval 
not containing at some stage. The ARE results based on simulations for a 
3-stage procedure, that are shown in Figure 3, are somewhat optimistic due 
to the trimming of outliers discussed previously; however, even with outliers 
included, improvements still occur, albeit small compared to the one-stage 
procedure. 

5. Confidence intervals for the change-point. We compare next the per- 
formance of exact confidence intervals based on the result established in 
Theorem 2 to those proposed in Ferger (2004). Moreover, confidence inter- 
vals for finite samples will be constructed following the discussion in Section 
4. 

For all these comparisons, simulations were run for a stump model with 
ao = 0.5, Po = 1-5, d° = 0.5 and sample sizes n = 50, 100, 200, 500, 1000 with 
N = 2000 replicates for each n. Confidence intervals for d^ based on the 
minimal minimizer dn^^i, the maximal minimizer dn2,u and the average min- 
imizer dn2,av = {dn2,i + dn2,u) /'^ were constructed. Two values of 7 = 1/2 and 
2/3 and two values of -ftT = 1 and 2 were used together with the optimal allo- 
cation A = 7/(1 -I-7) as discussed in Section 4. The confidence level was set at 
1 — T = 0.95 and the percentage of replicates for which the true change-point 
was included in the corresponding intervals, as well as the average length of 
each interval, were recorded. In what follows, the symbols di and du have 
the same connotations as in the proof of Theorem 3.2. 

5.1. Conservative intervals. Using the results of Ferger (2004), based 
on any two-stage estimator dn2 , we propose an asymptotically conservative 
confidence interval for d^ at level 1 — r: 

In2{T') ■= {dn2 - b/nl'^'^,dn2 - a/nl'^'^), 

where a <b are any solutions of the inequality 

Proh{du <b)- Prob(di < a) > 1 - r. 

Based on the smallest, largest and average minimizers at stage two, we 
therefore obtain intervals 

-^na.iW = {dn2,i - h/n\^^ ,dn2,i - a/n\^'^), 

I n2, uij'') ~ i.dn2,u ^/'^2 -1^112, u '^) 

and 
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where a is the r/2th quantile of di and b is the (1 — r/2)th quantile of du- 
At this point, these quantiles do not seem to be analyticahy determinable 
but can certainly be simulated to a reasonable degree of approximation. 

In Table 1 the coverage probabilities together with the length of the confi- 
dence intervals are shown for a number of combinations of sample sizes and 
tuning parameters and with the SNR set equal to 5. It can be seen that the 
recorded coverage exceeds the nominal level of 95% and almost approaching 
perfect (100%) coverage for the average minimizer. 

5.1.1. Exact confidence intervals. On the other hand, since Theorem 3.2 
provides us with the exact asymptotic distributions of the sample minimiz- 
ers, we can construct asymptotically exact (level 1 — r confidence intervals) 
as follows: 

LiA'^) = idn2,l - bi/nl'^'^,dn2,l - ai/nl'^^), 

^n2,u{T) — {dn2,u ^u/'^2 )'^n2,ii 0,u/f^2 )' 
in2,avij'^ — {dn2,av bav/ '^2 idn2,av Oiav/'^2 ^i 

where ai,bi,au,bu,aav and bav are the exact quantiles [ai,au and aav cor- 
respond to r/2th quantiles and bi,bu and bav correspond to (1 — r/2)th 
quantiles] of di, du and (d; + du)/2, respectively. 

In Table 2 the coverage probabilities together with the length of the confi- 
dence intervals are shown for a number of combinations of sample sizes and 
tuning parameters and with the SNR set equal to 5. It can be seen that the 
coverage probabilities are fairly close to their nominal values, especially for 
7 = 2/3. Further, their length is almost half of those obtained according to 
Ferger's (2004) method. Finally, it should be noted that analogous results 
were obtained for SNR = 2 and 8 (not shown due to space considerations) . 

5.1.2. Construction of confidence intervals in finite samples. Confidence 
intervals in finite samples can also be based on the adaptive parameter 
allocation strategies discussed in Section 4 for the two-stage procedure. We 
briefly discuss this below, adopting notation from that section. 

From (5) with equal allocation of points between the two stages we have 
that 

dn2,av -d^ K, — ArgminMsNR.Z,l• 
n^ 

Therefore, an approximate level 1 — r confldence interval is given by 



(12) 



^n2,av 9 i(^n2,avi n 



Simulations were run for the above stump model for four different sample 
sizes: 50, 100, 200 and 500 with 5000 replicates for each sample size and for 



Table 1 

95% conservative confidence intervals for a combination of sample sizes and the tuning parameters 7, K and for SNR = 5 



50 



100 



200 



500 



n = 1000 



K 



K 



K 



K 



K 



K 



K 



K 



K 



K 



O 
K 
> 

O 

M 

O 
H 



O 

c; 

O 

H 

> 
> 

> 



^n2,l 


97.40% 


97.55% 


97.40% 


98.20% 


97.65% 


98.00% 


96.55% 


97.05% 


97.15% 


97.75% 




(0.0580) 


(0.1208) 


(0.0205) 


(0.0427) 


(0.0072) 


(0.0151) 


(0.0018) 


(0.0038) 


(0.0006) 


(0.0014) 


97.05% 


98.60% 


97.65% 


97.05% 


97.65% 


97.85% 


97.40% 


97.90% 


97.80% 


98.00% 




(0.0580) 


(0.1208) 


(0.0205) 


(0.0427) 


(0.0072) 


(0.0151) 


(0.0018) 


(0.0038) 


(0.0006) 


(0.0014) 


1 71-2 ,av 


99.80% 


99.95% 


99.80% 


99.95% 


100% 


100% 


99.80% 


100% 


99.70% 


99.95% 




(0.0580) 


(0.1208) 


(0.0205) 


(0.0427) 


(0.0072) 


(0.0151) 

2 


(0.0018) 


(0.0038) 


(0.0006) 


(0.0014) 


In2,l 


98.15% 


97.70% 


97.65% 


98.30% 


7 = 

97.90% 


3 

97.70% 


97.65% 


97.30% 


97.75% 


97.70% 




(0.0299) 


(0.0581) 


(0.0094) 


(0.0183) 


(0.0030) 


(0.0058) 


(0.0006) 


(0.0013) 


(0.0002) 


(0.0004) 


^712 ,u 


98.00% 


98.20% 


97.85% 


98.55% 


97.90% 


98.10% 


98.30% 


97.75% 


97.60% 


98.50% 


In2 ,a.v 


(0.0299) 


(0.0581) 


(0.0094) 


(0.0183) 


(0.0030) 


(0.0058) 


(0.0006) 


(0.0013) 


(0.0002) 


(0.0004) 


99.60% 


99.95% 


99.60% 


99.90% 


99.85% 


99.95% 


99.85% 


99.90% 


99.85% 


99.95% 




(0.0299) 


(0.0581) 


(0.0094) 


(0.0183) 


(0.0030) 


(0.0058) 


(0.0006) 


(0.0013) 


(0.0002) 


(0.0004) 



Table 2 

95% exact confidence intervals for a combination of sample sizes and tuning parameters 7, K for SNR — 5 



n = 50 n = 100 n = 200 n = 500 n = 1000 



K = l K = 2 K = l K = 2 K = l K = 2 K = l K = 2 K = l K = 2 



In2,l 


95.00% 


94.20% 


93.80% 


95.25% 


95.25% 


94.10% 


93.40% 


94.50% 


95.05% 


94.90% 


(0.0283) 


(0.0599) 


(0.0100) 


(0.0212) 


(0.0035) 


(0.0075) 


(0.0009) 


(0.0019) 


(0.0003) 


(0.0007) 




94.20% 


96.50% 


94.80% 


94.95% 


94.45% 


95.85% 


94.85% 


95.90% 


95.50% 


95.85% 




(0.0294) 


(0.0602) 


(0.0104) 


(0.0213) 


(0.0037) 


(0.0075) 


(0.0009) 


(0.0019) 


(0.0003) 


(0.0007) 


In 2 , a 


94.30% 


95.25% 


94.35% 


94.05% 


94.40% 


95.45% 


93.20% 


94.85% 


94.55% 


95.30% 




(0.0236) 


(0.0487) 


(0.0083) 


(0.0172) 


(0.0029) 


(0.0061) 

2 


(0.0007) 


(0.0015) 


(0.0003) 


(0.0005) 


In2,l 


95.65% 


95.40% 


95.05% 


96.05% 


95.35% 


3 

95.45% 


95.30% 


95.30% 


95.05% 


95.90% 




(0.0148) 


(0.0277) 


(0.0047) 


(0.0087) 


(0.0015) 


(0.0027) 


(0.0003) 


(0.0006) 


(0.0001) 


(0.0002) 




95.40% 


96.00% 


95.65% 


96.60% 


95.80% 


96.15% 


96.20% 


96.60% 


95.15% 


96.85% 




(0.0149) 


(0.0302) 


(0.0047) 


(0.0095) 


(0.0015) 


(0.0030) 


(0.0003) 


(0.0006) 


(0.0001) 


(0.0002) 


In 2 , a If 


95.15% 


96.45% 


95.20% 


96.95% 


94.75% 


96.10% 


95.30% 


96.15% 


94.05% 


96.10% 




(0.0120) 


(0.0253) 


(0.0038) 


(0.0080) 


(0.0012) 


(0.0025) 


(0.0003) 


(0.0005) 


(0.0001) 


(0.0002) 
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Table 3 

95% confidence intervals constructed using the adaptive parameter allocation strategy 
for different sample sizes and SNR with (i — 0.0005 



JV 


SNR = 


2 


SNR = 


5 


SNR = 


8 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


50 


93.24% 


0.2780 


95.48% 


0.0383 


95.68% 


0.0329 


100 


94.08% 


0.0695 


95.24% 


0.0096 


95.54% 


0.0082 


200 


94.48% 


0.0174 


94.78% 


0.0024 


95.16% 


0.0021 


500 


94.82% 


0.0028 


95.08% 


0.00038 


94.94% 


0.00033 



three different values of SNR = 2, 5, 8. Confidence intervals as defined above 
were constructed (with r = 0.05). The percentage of intervals containing the 
true change-point together with their length were recorded and shown in 
Table 3. 

We examine next the performance of confidence intervals in finite samples, 
but where a uniform (equispaced) design is used in the first-stage (results 
shown in Table 4) and in both stages (results shown in Table 5). The setting 
is identical to that used in Table 3. It is not clear how the tuning parameter 
C^j , that determines the interval from which sampling is done at the second 
stage, should be chosen in this case, since the first-stage estimate may not 
even have an asymptotic distribution in the proper sense. Therefore, the 
same C^-^ value as the one used in Table 3 was employed. It can be seen 
that a uniform design used in the first-stage does not improve performance 
in terms of coverage or length. However, using a uniform design in both 
stages and setting C,^^ and 0^-/2 to the same values as in Table 3 leads 
to rather conservative confidence intervals, especially for larger sample sizes 
and higher values of SNR. Notice that the lengths of the confidence intervals 
are identical to those in Table 3 due to the choice of the tuning parameters 
C(^j and Ct-/2- Nevertheless, experience shows that a uniform design used in 

Table 4 

95% confidence interval constructed using the adaptive parameter allocation strategy 
for different sample sizes and SNR with = 0.0005 using a uniform design in the 

first-stage 



N 


SNR = 


2 


SNR = 


5 


SNR = 


8 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


50 


93.72% 


0.2780 


95.14% 


0.0383 


95.56% 


0.0329 


100 


93.88% 


0.0695 


95.12% 


0.0096 


95.20% 


0.0082 


200 


94.62% 


0.0174 


95.52% 


0.0024 


95.52% 


0.0021 


500 


94.72% 


0.0028 


94.96% 


0.00038 


95.12% 


0.00033 
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Table 5 

95% confidence interval constructed using the adaptive parameter allocation strategy 
for different sample sizes and SNR with (i = 0.0005 using a uniform design in both stages 



N 


SNR = 


2 


SNR = 


5 


SNR = 


8 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


50 


95.06% 


0.2780 


100.00% 


0.0383 


100.00% 


0.0329 


100 


96.66% 


0.0695 


99.98% 


0.0096 


100.00% 


0.0082 


200 


96.94% 


0.0174 


100.00% 


0.0024 


100.00% 


0.0021 


500 


97.32% 


0.0028 


99.96% 


0.00038 


100.00% 


0.00033 



the first-stage gives better mean squared errors in small samples, or when 
dP is closer to the boundary of the covariate's support. 

5.2. Data application. We revisit the motivating application and esti- 
mate the change-point using both the "classical" and the developed two- 
stage procedures. The total budget was set to n = 70 and the model fitted to 
the natural logarithm of the delays comprised two linear segments with a dis- 
continuity. Given that the data (134 system loadings and their corresponding 
average delays) have been collected in advance, a sampling mechanism close 
in spirit to selecting covariate values from a uniform distribution was em- 
ployed for both procedures. Specifically, the necessary number of points was 
drawn from a uniform distribution in the [0, 1] interval and amongst the 
available 134 loadings the ones closest to the sampled points were selected, 
together with their corresponding responses. An analogous strategy was used 
when a uniform design was employed in the first-stage of the adaptive proce- 
dure. For the two-stage procedure, we set A = 1/2 and the remaining tuning 
parameters to those values provided by the adaptive strategy discussed in 
Section 4, with d = 0.0005. The results of the "classical" procedure, the 
two-stage adaptive procedure with sampling from a uniform distribution in 
both stages and from a uniform design in the first-stage and the uniform dis- 
tribution in the second stage are depicted in the left, center and right panels 
of Figure 4, respectively. The depicted fitted regression models are based on 
the first-stage estimates for the two-stage procedure. Further, the sampled 
points from the two stages are shown as solid (first-stage) and open (second 
stage) circles. It can be seen that the heavier sampling in the neighborhood 
of the first-stage estimate of the change-point improves the estimate given 
the available evidence from all 134 points shown in Figure 1. 

The estimated change-point from the "classical" procedure is dn = 0.737 
with a 95% confidence interval (0.682,0.793). Using a uniform distribution 
in both stages gave an estimate dn^ = 0.796 with a 95% confidence interval 
(0.781,0.811). On the other hand, a combination of a uniform design in the 
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Fig. 4. Sampled points (from first-stage solid circles and from second-stage open circles) 
together with the fitted parametric models and estimated change-pomt, based on a total 
budget of n = 70 points, obtained from the "classical" procedure (left panel), the two-stage 
adaptive procedure with sampling from a uniform distribution in both stages (center panel) 
and from a uniform design in the first-stage and the uniform distribution in the second 
stage (right panel). 

first-stage with that of a uniform distribution in the second stage yielded 
an estimate = 0.802 with a 95% confidence interval (0.787,0.817). As 
shown in this case and validated through other data examples, the use of 
uniform design in the first-stage proves advantageous in practice, especially 
for small samples or in situations where the discontinuity lies fairly close to 
the boundary of the design region. 

6. Concluding remarks. In this study, a multistage adaptive procedure 
for estimating a jump discontinuity in a parametric regression model was 
introduced and its properties investigated. Specifically, it was established 
that the rate of convergence of the least squares estimator of the change- 
point can be accelerated and its asymptotic distribution derived. Several 
issues pertaining to the tuning of the parameters involved in the procedure 
were examined and resolved. In practice, it is generally recommended that in 
the presence of a small budget of points, a uniform design in the first-stage be 
employed. At present the parameters of the parametric model are estimated 
in the first-stage of the experiment and held fixed in subsequent stages. One 
may wonder why the parameters are not re-estimated in the presence of 
additional samples. The main reason is that the additional samples obtained 
from a shrinking neighborhood around dP are not particularly helpful for 
estimating global regression parameters. The sole exception is the stump 
model, where using all n points provides better estimates, especially for 
small budgets. 

The results have been established under the assumption of a bounded 
covariate, because this is usually the case in applications. However, the ac- 
celerated convergence can be achieved with an unbounded covariate as well, 
so long as the first-stage estimate is consistent at rate n, which will be the 
case under fairly mild conditions. 
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Another issue is the case where there are multiple change-points, but with 
their number known a priori. In such a case, one will estimate each of the 
change-points at stage one (at rate n) and subsequently construct appropri- 
ate neighborhoods around these estimates. Notice that asymptotically the 
neighborhoods are disjoint and therefore one can perform the second-stage 
experiment for a single change-point model independently on each of them. 
However, there is an issue of allocation of second-stage points in the neigh- 
borhoods. Some algebra involving minimization of the sum of the asymptotic 
variances of the second stage estimators shows that 50% of the budget should 
be allocated at the first-stage, while the remaining budget should be allo- 
cated proportionally to the limiting standard deviations of the first-stage 
estimates, which can be estimated from the first-stage experiment. The real 
challenge with multiple change-points is when their number is unknown and 
needs to be determined from the data. This topic remains to be studied. 

We briefly address the case of multiple covariates next. The simplest pos- 
sible model is for two covariates on the unit square and defined as 
follows: 

Yi = ail{xl < d^) + (3il{x\ > d^) + a2l{xi < d^) + (32l{xi > d^) + e^. 

For the sake of simplicity, suppose that a budget of 2n points is available. The 
change-points d^ and can be estimated at rate n, by sampling n points 
from a uniform distribution on the unit square and solving the corresponding 
least squares problem. One can then form neighborhoods around d} and 
cP' , respectively, of the type employed in the univariate models. Hence, a 
"shrunken" rectangular neighborhood of (dl^,d^) is obtained, from which 
one can sample n points uniformly at the second stage. The least squares 
estimates from stage two will inherit the accelerated rate of convergence. 
Similar considerations apply to general parametric models. 

Note that in this setting, there are changing regression models on each 
quadrant of defined by two orthogonally intersecting lines. A more com- 
plex model arises when the lines intersect at an unknown angle. However, a 
treatment of such models is beyond the scope of this paper and is left as a 
topic of future research. 

Finally, it should be noted that a multistage adaptive procedure would 
also work in the context of a nonparametric model with a jump discontinuity 
and produce analogous accelerations to the convergence rate of the employed 
estimator; a comprehensive treatment of this topic is currently under study. 

APPENDIX 

Proof of Lemma 3.2. We first note that (which we view as a met- 
ric subspace of D[—C, C] x D[— C, C]) is a measurable subset of D[—C, C] x 
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D[—C, C]. To establish convergence in distribution in the space V^, it there- 
fore suffices to estabUsh convergence in distribution in the larger space 
D[-C, C] X D[-C, C] [see the discussion in Example 3.1 of Billingsley (1999)]. 
This can be achieved by (a) establishing finite-dimensional convergence: 
showing that {M„2(/ii), J^^j/ii)}^^^ ^ {M(/ii), J(/ii)}'=i ^^v all hi,h2,...,hi 
in [— C, C] and (b) verifying tightness of the processes (Mn2(/i), JIn2 {h)) under 
the product topology. But this boils down to verifying marginal tightness. 

Let L+(t) = ^[e**(W^-("o+/3o)/2)|f/ = jO+] = limrf_^o+ £;[e^*(^-("»+^o)/2)| 
U = d] and L~{t) = £;[ei*(W^-K+/3o)/2) ^ ^O]^ jg difficult to see that 
is the characteristic function of the ^^''^'s while L~ is the characteristic 
function of the V~'s. In order to establish finite-dimensional convergence, 
we first show the convergence of one-dimensional marginals: that is, for a 
fixed s, M„2(s) converges in distribution to M(s). We do this via charac- 
teristic functions. Consider 0s (t), the characteristic function of M(s) (with 
s > 0). We have 



oo 



1=0 

oo 



-(s/2j^)A/(i^Ar(g/(2ir)A/(l-Ar) 

n 



/2K)X/{1-X)i 

1=0 

^ g-(s/2/^)A/(l-ApgL(t)(./2i^)A/(l-Ap ^ ^-(s/2K)X/{l~X)'>[l~L{t)) _ 

We show that Qn2,s{t) = £'[e**^"2(*)] converges to 4>s{t)- Let = ni{dni 
dP),'nni,i = \AH(ani - ao),'nni,2 = V^(/^ni ~ M- ^e have 



where is the joint distribution of (??ni,ii ??ni,2> ^m) and 

= ^[eiW-(«ni+/3ni)/2){/(l/i<a!0+s/n^+^)-/(C/i<dO))|^^^ 1 = 77I, 

Vrn,2 = V2,&n =^r^- 

Let e > be pre-assigned. By Proposition 1, we can find L > such that 
for all sufficiently large n, ([— L, L]^) > 1 — e/3. Using the fact that char- 
acteristic functions are bounded by 1, it follows immediately that for all 
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n ^ -^0 (for some Nq), 



\QnUt)-Mt)\ < 



^^^P(VuV2,i;M~L,Ly' \Qn2,sit^m,V2,0 - (t'sit)\ +2e/3. 

For this fixed L, we now show that for all sufficiently large n 

Dn= sup |Q*2^^(t,ryi,ry2,0 - (AsCOl < e/3, 

te,r?2,C)e[-L,L]3 

whence it follows that eventually \Qn2,sit) — 4>s{t)\ < e. To show the uniform 
convergence of Q*^ s(i, f/i, to (psit) over the compact rectangle [—L,L]^ 
we proceed as follows. 

For given L and C, it is the case that for all sufficiently large n, for any 
^ G [-L, L] and any < s < C, 

d° + ^/ni- Kn~"< <(f<(f + <(f+i/ni + Kn'^ 

Let Pn2,d{s) = Pr(d'^ <Ui < + s/n2'^'^|d„j = d). Consider the conditional 
characteristic function (5^2 s(*i fo^' (^11^2,0 G [— -^i-^]^- It follows 
from the above display that for all sufficiently large n (depending only on L 
and C), 

Qn2(*'^l'^2,0 



[1 -^n,2,dO+C/ni('5)] 



do 



where a„j = ao + 



V2 



[1 --Pn2,dO+€/ni('S)] 
7 M"+s/n\ 



+ 



2K Jdo 
1 



E[e'^(W^-(^'-i+'^"iy^)\Ui = u]du 



"2 



A 



-- - 

n2 V 1 - A 



+ 



^7 ^rfO+s/n, 



1 + 7 
2 



n2 
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X / E 



IK' 



exp<^ it Wi 



Ui = <f + — 



rir, 



1 



1+7 



dv 



n 



I s ( \ V 



n22K\l-X 



+ 



1 



A V 



2Kn2 V 1 - A 



-«t{m+'?2)/(2v^) 



X / E 



„ii(M/i-(ao+/3o)/2)|[; ^ ^0 _^ ^ 



1 s / A 



where 



1 



{s)) 



+7 



"2 



n2 



V 







no 



(if 



and 

Dr, 



sup 

{m,'?2)e[-L,L]2 



1 



1 s / A \T 



(1 ^ni,n2,rii,rj2is)) 



n2 



na 2K VI - A 

Let Znim^m) = -2^(t^)^(1 - -Sni,n2,r;i,r;2('5))- ^ IS easy to See that 



Dn= sup kn(r/i,?72) - 2^0! ^ 0, 

(r)i,r;2)e[-L,L]2 

where = -(s/2/C)(A/(l - A))t(1 - Consider now 



£>„ = sup 



1 H Zn{rii,m) 

n2 



"2 



=^0 



This is dominated by /„ + II n where 



and 



In 



II n = sup 

(m,'?2)e[-L,L]2 



1 \ "2 



1 + —ZQ 

n2 



1 / 1 

1 H — Znim-.m) - H — 

"-2 / V "2 



n2 



Since Dn goes to 0, for ah sufficiently large n, |zo| V (sup^^ ,^2)6[-l,l]2 l^nl??!, %)|) 
is bounded by a constant, say M . Straightforward algebra shows that for all 
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sufficiently large n, 



n2\ jMi~^' 



IIn<[ sup \Znir]i,V2) - Zo\) ^ ^ , 

HvumM-L,L]^ ^ \j=i ^ ^ ^ , 

sup \znyni^'n2) - zq\\\i^ — ^o. 

(»?i,'?2)G[-L,L]2 /V "-2/ 

Thus, Dn ^0 and the uniform convergence of Qn2(^i^i'^2iO to 0s (t) = 
on [—L,Lf' is established. 

The next step is to establish the weak convergence of the finite-dimensional 
distributions of (M^jjI^j) to those of (M,J). For convenience, we restrict 
ourselves only to the set [0, C] . Let J be a positive integer and consider 
Q = sq < si < S2 < ■ ■ ■ < sj < C . Let ci , C2 , . . . , cj and di,d2, ■ ■ ■ ,dj be con- 
stants. We want to show that 

for any vector of constants {ci,C2, ■ ■ ■ ,Cj,di,d2, ■ ■ ■ ,dj). By the Cramer-Wold 
device it follows that 

{{MnM)-'^nM-l)yLldSnM)-^nM-l)yLl) 
4 ({M(5i) -M(5,„i)}ti,{J(5i) - JI(s,_i)}ti), 
establishing the claim. As before. An = J K*,^ {t, r/i , ry2, dZ^ (^i , ??2; > where 

Proceeding as before, the convergence of An to A follows if we establish the 
uniform convergence of K*,^{t,r]i,r]2,£,) to ^ on a compact rectangle of the 
form [— L,L]^. This is achieved by using arguments similar to those involved 
in the proof of the convergence of the one-dimensional marginals above. The 
details of the algebra are available in the Appendix of Lan, Banerjee and 
Michailidis (2007). The derivation there can be extended readily to allow for 
Sj's that can also be negative but that has been avoided as that extension 
involves no new ideas but becomes more cumbersome. 

We finally show that the process M„2 restricted to [—C,C] is tight. We 
know that , /^m , c^m ) — >p (ao,/3o,d°) and rei(d„i - d°) = Op(l). Let 



^^n. = ||ani - aol < A, \(3ni - M < A, 



dn, ^j<d <d + < dn, + 
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Clearly, — > 1. The event Qn can be written as {am, Pnndm) G Rn, 
where Hn{Rn) — > 1> Hn being the joint distribution of (a„^,/5.„i,(i„^). Note 
that M„2l(Q„) is also a process in We verify tightness of M„2l(^^n) 

restricted to [— C, C]. To this end, we verify (the analogue of) condition 
(13.14) on page 143 of Billingsley (1999), with (3 = 1/2 and a = 1. Once 
again, let < si < s < S2 < C*, 



E[|M+(.)-M+(si)|-|M+(.2) 



"2 



»|l(^^n)] 



Rn 



E[\MUs 



(si)|-|M+(s2)-M+(.)| 



X |(ani,/3ni,dni) = (a,/3,d)] dHn{a,P,d), 

where 

E[\M+^is) - M+ (.i)| • |M+ (.2) - MtMliamJmJn,) = (a, /3,d)] 



E, 



a,l3,d 



i=l 



a + (3 



1+7 



n 



1+7 



n 



n2 



i=l 



a + f) 



i[u.<d^ + :^ 



llUi<d^ + ^ 



+7 



E, 



Y.l[d' + ^<U.<d' + - 



Tin 



rin 



x/(dO + _f_<c/,<dO+ 



1+7 - ^3 



1+7 



n 



X Wi 



a + P 



a + f5 



< E I ^ ( ^° + ifi? ^ < + - 



+7 



1+7 



X Wi. 



a + ,3 



1+7 

"2 

a + /3 
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a + 13 



sup 

-(1 + 7) . 



E. 



-(1+7)1 



X sup Ea,l3,, 



W, 



2 

a + /3 



U,=t 



nj s — si 
2K nl+^ 



n{ S2- s 



2K nl+^ 



s — Si 



2/C?i2 V 1 - A 



H2 



S2 - s / A V 



2Kn2 V 1 - A 



for any {a, f3, d) G R„ 



H1H2J2 



{S-SI){S2-S) / A 



1-A 



<C*{S2-Slf. 



It follows that 



E[\mUs) 



n2 



51 



n2 



{S2) 



{s)\mn)\ 



< Hn{Rn)c*{s2 " Si)' < C* {S2 " S^f , 

which establishes tightness of M^jll^n)- 
Then, given any e > 0, Vn > A''i, 

Prob[cj:M„,l(17„)(tj) G/C] > 1 - e, 

where /C is a compact set. But Prob[Lij : oo G > 1 — e eventually. Therefore, 
eventually 

Prob[u; G 0„ and M„2l(J7„,) G /C] > 1 - 2e 

and consequently Prob[M„2 £ A^] > 1 — 2e. This establishes the tightness 
in the space of right-continuous left limits endowed functions on 



•^112 



of 

[— C, C]. Similarly, the tightness of JI„2 can be established. This completes 
the verification of marginal tightness and therefore joint tightness. 

Before embarking on the proof of Theorem 1, we need some auxiliary 
lemmas. We first state these below. 



Lemma A.l. Suppose that Xi,X2, ■ ■ ■ ,Xn are i.i.d. random elements 
assuming values in a space X . Let T he a class of functions with domain X 
and range in [0, 1] with finite VC dimension V{J-) and set V = 2{V{J-) — 1). 
Denoting by Pn the empirical measure corresponding to the sample and by 
P the distribution of Xi, we have 

Pr*(|| V^(P„ - P)||^ > A) < i^-^j exp(-2A2). 



This lemma is adapted from Talagrand (1994). 
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Lemma A. 2. Let Ui,U2, ■ ■ ■ ,Un be i.i.d. random variables following the 
uniform distribution on (0, 1). Let €i,e2, ■ ■ ■ be i.i.d. mean random vari- 
ables with finite variance that are independent of the Ui's. Let Pn{s) = 
J2i'=i ^i^i^i < s). Then for any < a < f3 < 1, we have 

pj sup ^^iM > a) < (a-i - r')A-V2. 

\a<s<f3 S } 

The proof of this lemma fohows the proof of Theorem 1. □ 

Proof of Theorem 1. For simphcity and ease of exposition, in what 
follows we assume that n points are used at the first-stage to compute es- 
timates an,Pn,dni of the three parameters of interest. At the second stage 
n i.i.d. C/j's are sampled from the uniform distribution on Dn = [dm — 
Kn~'^,dni + Kn~^] and the updated estimate of d^ is computed as 

1 

dnn = argmin — y^[VFj — an^{Ui <u) — (3n^{Ui > u)]^ = argminS'S'(n). 

u£Dn 1=1 ueD„ 

In the above display Wi = f{Ui) + ei where e^'s are i.i.d. error variables. 
Working under this more restrictive setting (of equal allocation of points at 
each stage) does not compromise the complexity of the arguments involved. 
Finally, recall that by our assumption, i?[e'~'l'^^l] is finite, for some C > 0. 

Before proceeding further, a word about the definition of argmin in the 
above display. The function SS is a right-continuous function endowed with 
left limits. For this derivation, we take the argmin to be the smallest u in 
Dn for which mhi{SS{u),SS{u—) =mi^^j^^SS{x). 

Denote by Gn the distribution of {an, Pn,dni). Now, given e > 0, find L 
so large that for all sufficiently large n, say n > Nq, 

ian,Pn, dni) G ["o " L/^/n, ao + L/ ^/n] x [(3o - L/ ^fn, /3o + Lj ^fn\ 

X [d°-L/n,d° + L/?i] 

with probability greater than 1 — e. Denote the region on the right-hand side 
of the above display by Rn- Then, for all n> Nq, 

Pr(ni+^|d„2 >a) 

< / FTi:{n^~^'^\dn2 — d^\ > a\an = a, (3n = P,dni = t) dGn{oi, (3,t) + e, 

J Rn 

which is dominated by sup(Q,^^^j)gj:j^^ Fvt^a,f3{n^'^'^\dn2 — d^\ > a) + e. By mak- 
ing a large, we will show that for all sufficiently large n (say n> Ni > Nq), 
the supremum is bounded by e. This will complete the proof. 

First note that since Nq is chosen to be sufficiently large, whenever n > Nq 
and te[d^ - L/n, d° + L/n] , it is the case that t - Kn'^ < d^ - Kn''^ /2 < 
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(f + Kn~"'/2 < t + Kn^'^]. Let P„ be the distribution of the pair (H^i, C/i) 
generated at stage two under first-stage parameters (a,/3,t) and P„ is the 
empirical measure corresponding to n i.i.d. observations from P^. It is not 
difficult to see that (i„2 = ^'^Z^'^'^d&[t-Kn-'y,t+Kn-~i]^n{d) where 
Mn{d) = ^n[{w - {a + (3)/2){l{u < d) - l(n < d?))], while, d = 
argminrfg[t_ft-„-^^t+^„-^] M„(d) where M„((f) = - (q + /3)/2)(l(n < 

d)-l{u<d^))]. We have 

Mn{d) = {|/3o - (a + /3)/2| \d - (f\l{d > 

+ |qo - (a + P)/2\\d - d^\l{d < d^)}(n'/2K). 

Now, for < r < K/2, set a{r) = min{M„((i) : |(i - > rn"'^}. Then a(r) = 
min(|/3o - (a + /?)/2)|, |ao - (a + /?)/2)|)r/2E: and let h{r) = {a{r) - Mn{d^))/ 
3 = a(r)/3. Now, for all n > A'o, for a,P in the region under consideration, 
6(r) is readily seen to be uniformly bounded below by Kr for some constant 
K depending only on ao, (3o, K, Nq. We then have: 

(13) sup iMnid) - Mn{d)\<b{r) {dn^ - d^\ Km'''' . 

d&[t-Kn-t ,t+Kn—i] 

To prove this, assume that the inequality on the left-hand side of the above 
display holds and consider d G [i — Kn~^ ,t + Kn~^] with \d — d^\ > rn~^ . 
Then, M„((i) > Mn{d) - h{r) > a{r) - b{r) and M„(d°) < M„((i°) + b{r) 
jointly imply that Mn(d) - M„((i°) > a(r) - 6(r) - M„(d°) - 6(r) = b{r) > 0. 
Hence, M„(d) >M„((iO). 

Now, since is the smallest d G Dn for which M„((f) A M„((i— ) = 



inf 



II„(x) and M„ is a (right-continuous left limits endowed) piece- 
wise constant function with finitely many flat stretches, it is easy to see 
that 

\dr, 



^{dn^) =inf 



(x). Therefore, Mnidni) < M„((i°), showing that 



rn in view of the last display above. 



Now, consider Pro,/3,t(|dI„2 — d^\ > rn '^), which is 

PTa,l3,ti\dn2 - d^\ > m''^) 

(14) < PraM^^"'' < \dn2 -d^\< Sn-^) 



+ PTaM\dn,-d''\>6n-^) 
(15) =Pn{a,/3,t) + Qn{a,p,t), 

where 6 (is sufficiently small, say less than K/3) does not depend on t,a,(3. 
We deal with Qn{a,P,t) later. We first consider Pn{a,l3,t) = Pva,f3,t{^^~^ < 
\dn2 — d?\< 6n~'^). Since, 



{rn"'' <\dn,-d^\ <6n^^}Q 



u 



{M„(d)<M„(dO)} 
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U 



U {M„(d)<M„(dO)} 

■dO-Sn-"i<d<dO-rn-^ 



we conclude that 

Pn{a, /3, t) < Pn,i{a, 13, t) + P„,2(a, t) 

= Pia,/3,t ( U {M„((i°) - M„((i) > 0} J 

U {M„(d°)-M„(d)>0} . 

VdO-<5n-T<(i<dO-rn-T / 

We first construct an upper bound on snp(^^f^j.-^^ff^^Pn,i{a,(3,t). For any 
d E {dP + rn~'^, + Jn"'''] we have 

Mn{d^) - M„(d) 

= (M„(d°) - M„((i°)) - (M„(d) - Mn{d)) - {Mn{d) - Mnid"")) 



-{Mn{d)-Mn{d)) 



a + f3 



n ' 

2k' 



\d-d\ 



Hence < M„(dO) - M„(d) {2Ky^\l3Q ■ 



. Now, for 



stant whence it follows that < M„(d°) - M„((i) ^ l^"^f^7^{v/'^^l > M.. 



Thus, 



all {a,/3,t) G i?„ (with n > A^'o), |/3o — is bounded below by some con- 

n{d)-AI„{d)\ ^ B 

nt\d-dO\ — 2K- 

{Mn{d^)-Mn{d)>0} 

lnid)-Mn{d)\ 



u 



dO-Sn-'i<d<d°-rn~-' 



c 



sup „ 



where B = B/2K. We thus have 



Pn,i{a,P,t) < Pra,f3-, 



sup 



,id)-Mn{d)\ 



> B 



Pr 



\y/n{rn- Pn)fd,o,,p{^.w)\ 



sup 

dO+rn-T<d<dO+5r!,-T 



|d-(iO|n^ 



> VnB], 



where 



fd,aAu.w) = {w-{a + /3)/2)(l(n <d)- l(n < d^)). 

Using the fact that for d > d^, {Wj - (a + p)/2){l{Uj < d) - l{Uj < d^)) = 
iPo - (a + p)/2){l{Uj <d)- l{Uj < d^)) + ej{l[Uj < d) - l{Uj < d°)), this 
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upper bound on Pn^i{a, (3,t) is easily seen to be dominated by In + II n 
where 



4 = Pra,/3,t(|/?o-(a + /3)/2| 

I V^(P„ - Pn){l{u < + sn-T) - l(n < d^))\ 



X sup 

r<s<5 * 

which in turn is dominated by 

/ \V^{Fn-Pn){l{u<d^ + sn-^)-l{u<d^))\ ^^B\ 

In - ^"^0,13,1 1 sup > — — I , 

\r<s<5 S IB' J 

where, for n> Nq and (a,/3,t) G Rn, B' is an upper bound on |/3o — (a + 
/3)/2|, while 

//„ = Prt sup 

\r<s<S S 

> y^B/2 

Since the UiS are i.i.d. Uniform on \t — Kn~'' ,t + Kn^'^], it is easy to see 
that is simply 

p/g^ \V^{Qn-Q){l{w<s))\ ^ ^B\ 

\rTF<s s - 2B' )' 

where Wi,W2, ■ ■ ■ , Wn are i.i.d. Unif [0, 2i^], Q„ is the empirical measure of 
the WiS and Q is the distribution of Wi. In terms of Ui,U2, ■ ■ ■ ,Un, which 
are i.i.d. Unif[0, 1], this expression is simply: 

p / \VE{Fn-P){l{u<s))\ ^ ^2KB \ _ 

\r/2K<s<5/2K « 2i3' / 

By Lemma A. 3 of Ferger (2005), this is bounded above by a constant (that 

depends only on uq, Po, K, Nq) times 1/rn. Now, in terms of 

and (where the q's are defined on the same probability space as 

the C/j's, but independently of them, and are distributed like the e^'s) II n is 

simply 

rri sup ^ 

\r/2K<s<5/2K ^ ^ 



CHANGE-POINT ESTIMATION UNDER ADAPTIVE SAMPLING 37 



and this, by Lemma A. 2, is dominated up to a constant (that only depends 
on ao,f3o,a,K,NQ) by {l/rn). It follows that for some constant Co, for all 
n- > -^0) suP{o,/3,t)Gi?„ -fni(0)/3,i) < CQ/rn. A similar (uniform) bound works 
Pn2{oi, P,t). It follows that sup(^^ p f-^^ji^ Pn{a, (3, t) < Co/rn, at the expense 
of a larger constant Cq. Thus, from (15), we have 

sup Pra,f3,ti\dn2 - (^\> rn~'^) <Co{rn)~^ + sup Qnia,(3,t). 

To find a uniform upper bound on Qn{a, (3,t) note that, from (13), we have, 
for all n > A'^o 



< Pr 



a,f3,t 



sup 

de [t~Kn-i,t+Kn—'] 



sup 

de [t-Kn-i,t+Kn—'] 



\Mn{d) - Mn{d)\ > b{6) 

\Mn{d) - Mn{d)\ > k6 



and it suffices to find a uniform upper bound for this last expression. But 
this is bounded by 



Pr 



sup 

rfG [t-Kn--',t+Kn' 



+ Pr, 



a,l3,t 



sup 

de [t-Kn-~',t+Kn-^ 



|V^(P„-P„)(Ai(n)-(a + /3)/2) 

X (1(m <d)- 1(m < /))| > ^/Rk6/2 

n 



n 



i=l 



> k5/2 



To tackle the first term, we invoke Lemma A.l. For (a,/3,t) G Rn, the class 
[{fi{u) - (a + /3)/2)(l(w <d)- l{u < d°)):d€ [t - Kn-^ ,t + Kn-f]] is a 
bounded VC class of functions (with the bound not depending on a,f3,t) 
and with finite VC dimension, say V (which does not depend on a,l3,t). 
Hence, we can apply Lemma A.l to conclude that 



Pr 



sup I V^(P„ - Pn){Ku) - (a + (3)/2) 

de [t-Kn-^,t+Kn-'i] 



X {l{u <d)- l{u < d°))\ > ^Hi6/2 



<ClX {y/nK6) 



2(V-1) 



exp(— C2nK^(5^) 



where the constants Ci and C2 depend solely on the VC dimension and the 
upper bound on the functions. For all sufficiently large n, the right-hand 
side of the above display is less than e/3. To deal with the second term. 
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we use the results on pages 132, 133 of Van de Geer (2000). We write the 
second term as: 



Pra,/3,t sup 

\de[t~Kn—t,t+Kn—i] 



n 



i=l 



l(^x.<dO)) 



> k6/2 dHniui,U2,...,Un), 



where Hn is the joint distribution of {Ui, U2, • • • , Un)- For each fixed {ui,U2, ■ ■ ■ ■ 
Un), Corollary 8.8 of Van de Geer (2000) can be used to show that for 5 suf- 
ficiently small and n sufficiently large (where the thresholds do not depend 
on the tij's or a,(3,t), 



PrQ,/3,t sup 

\de [t~Kn-i ,t+Kn-^\ 

<(7exp(-C"n(52) 



n 



-iX^e,(l(n,<(i)-l(ni<dO)) 



> k5/2 



for some constants C and C that do not depend on a,(3,t or the points 
(ui, U2, . . . , Un). This implies that the second term can be made less than 
e/3 by choosing n sufficiently large. It follows, that for all sufficiently large 
n (say n > A^i > A'o) and an appropriate choice of 5, we have 

sup FTa,f3A\dn, -(f\> rn"^) < Co(rn)-i + 2e/3; 

(a,/3,i)e-R„ 

the first term on the right-hand side can be made less than e/3 by choosing 
r = A/n where A is large enough, showing that for all sufficiently large n, 
we can find A large enough so that: 

sup Pra,/3,t(n^+'^|(i„2 - d^\ > A) < e. 

For the details as to how Corollary 8.8 of Van de Geer (2000) is applied 
in our setting, see the longer version of this proof in Lan, Banerjee and 
Michailidis (2007). □ 

Proof of Lemma A. 2. Let /?„,(s) = -^YA=i^iH^i < s)- Let {sk = 
a + {P- a)2~"' : < A; < 2'"} , m G N be a dyadic partition of [q, f3] . Consider 



P(a,/3,A) 



P[ sup ^>A 

\a<s<f3 S 
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V Dl \l3n{Sk)\ . , 

lim F max > A 



oo \l<fc<2'" Sk 



lim 

m—>oo 



P\ max 

{xi,...,x„)e(0,l)" Vi<*:<2^ 



/ 1 " \ 
—;=? ^i^(xi < si.) 

X {sk)~^ > dxi dx2 ■ ■ ■ dxn- 



For fixed (xi, X2, . . . , x„) G (0,1)", set = ^^^^^ea(xi < Sk),0 <k< 
2"^. Define = — M^-i for A: > 1. Then the X^s are independent 
random variables, each with mean and finite variance and Mk = Xi + 
X2 + ■ ■ ■ + Xk- Since 1/sfc is a decreasing sequence of constants, we can 
apply the Hajek-Renyi inequality [see, e.g., the Appendix of Lan, Banerjee 
and Michailidis (2007)] to conclude that 



l<fc<2'" Sk 



A2 



l<k<2^ 



J2 sfiEMi-EMt,) 



l<k<2" 



Now, 



EMl = E 



^eil{xi < Sk) 



i=l 







)■] 


= Var 



n 2 



^eil{xi < Sk) 



1=1 



-cr^ Vl(xi < Sk). 



i=l 



It follows that EMl - EMl_^ = ^ Ya=i l(sfe-i <Xi< Sfe)- Therefore 



P(a,/3,A) < lim / 



- y 



l<fc<2" 
n 



n 



x„)G{0,l)" \ 

X y l('Sfc_i < < Sfc) !> dxi • • • dx^ 
i=i J 

lim — V — V / l(sfe_i < < Sfc)dxi 

l<fc<2™ 1=1-^ 

lim -^o-^ ^ s^2(sfc-Sfc_i) 



m— >oo A^ 



l<fc<2" 



A2 



A2' 



□ 
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